Instructions
Welcome to the visual presentation of the Cyclistic bicycle company’s data analysis project, where we present the final steps and analyses in the form of a chronologically ordered series of images representing the stages of the study from the beginning to the final results.
I introduce myself as a student in the Data Analyst course offered by Coursera in collaboration with Google. I am at the end of the project and have reached the case study stage in front of you, in which I will present the scenario and introduction to this project. After that, we will document all the work from beginning to end through structural thinking, a plan, and a roadmap. I am trying to present this project in a professional and organized manner and using the methodology that I learned throughout my career in this course.
As a junior data analyst, the insights derived from this analysis will serve as the foundation for the marketing team to design a new, targeted strategy. The ultimate goal is to maximize the number of annual memberships by understanding the differences in usage patterns between the two customer types. This understanding will enable the creation of marketing campaigns specifically aimed at converting existing casual riders into more profitable annual members
With this information,the goal of my team is to design a new marketing strategy to convert casual riders into annual members through these questions we need to better understand:
Case Study Roadmap - Ask
● What is the problem you are trying to solve? ● How can your insights drive business decisions?
Clear Statement of the Business Task The primary business task is to analyze the historical bike trip data to determine how annual members and casual riders use Cyclistic bikes differently
● Identify the business task ● Consider key stakeholders ● A clear statement of the business task ● Identify the specific data needed (e.g., historical trip data). ● Determine data credibility and integrity.
##Tools used
Spreadsheet: Microsoft Excel Data Analysis: SQL (Google Big Query) and R Programming Language (R Studio) Visualization: Tableau Public and R Programming Language (R Studio)
We will start by using Cyclist’s historical flight data to determine trends. We will use the R desktop To avoid storage and file size problems. Download bicycle trip data for the previous 12 months from 2022/05 to 2023/04 kaggle.com . (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license.) This is public data that you can use to explore how different customer types are using Cyclist bikes But note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple
This score is calculated by Kaggle. link Version 1 (1.13 GB)
Check Subtitle Check Tag Check Description Check Cover Image
check Source/Provenance check Public Notebook close Update Frequency close check Column Description
check License check File Format check File Description
R Programming Language (R Studio) Visualization: Tableau Public and R Programming Language (R Studio)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
files <- list.files(path = data_raw_path, pattern = "*.csv", full.names = TRUE)
cat("Found", length(files), "data file.\n")## Found 12 data file.
all_trips <- files %>%
map_dfr(function(file) {
cat("Loading:", basename(file), "\n")
read_csv(file, show_col_types = FALSE)
})## Loading: 202205-divvy-tripdata.csv
## Loading: 202206-divvy-tripdata.csv
## Loading: 202207-divvy-tripdata.csv
## Loading: 202208-divvy-tripdata.csv
## Loading: 202209-divvy-publictripdata.csv
## Loading: 202210-divvy-tripdata.csv
## Loading: 202211-divvy-tripdata.csv
## Loading: 202212-divvy-tripdata.csv
## Loading: 202301-divvy-tripdata.csv
## Loading: 202302-divvy-tripdata.csv
## Loading: 202303-divvy-tripdata.csv
## Loading: 202304_divvy_tripdata.csv
## ✅ Merge all files successfully.
## # A tibble: 5 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 EC2DE40644C6B0F4 classic_bike 2022-05-23 23:06:58 2022-05-23 23:40:19
## 2 1C31AD03897EE385 classic_bike 2022-05-11 08:53:28 2022-05-11 09:31:22
## 3 1542FBEC830415CF classic_bike 2022-05-26 18:36:28 2022-05-26 18:58:18
## 4 6FF59852924528F8 classic_bike 2022-05-10 07:30:07 2022-05-10 07:38:49
## 5 483C52CAAE12E3AC classic_bike 2022-05-10 17:31:56 2022-05-10 17:36:57
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
## 📊 For available columns:
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
if (all(c("started_at", "ended_at") %in% names(all_trips))) {
all_trips <- all_trips %>%
mutate(
started_at = ymd_hms(started_at, quiet = TRUE),
ended_at = ymd_hms(ended_at, quiet = TRUE)
)
cat("🕒 The time columns have been converted to datetime format.\n")
} else {
cat("⚠️ The time columns (started_at / ended_at) do not exist or have different names.\n")
}## 🕒 The time columns have been converted to datetime format.
output_file <- file.path(data_cleaned_path, "all_trips_raw.csv")
write_csv(all_trips, output_file)
cat("\n💾 The merged file was successfully saved to:\n", output_file, "\n")##
## 💾 The merged file was successfully saved to:
## C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_raw.csv
# 8. Quick overview of size and structure
cat("\n💾 The merged file was successfully saved to:\n", output_file, "\n")##
## 💾 The merged file was successfully saved to:
## C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_raw.csv
##
## 📈 Number of rows: 5859061
## 📊 Number of columns: 13
main_dir <- here::here("C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy")
raw_merged_file <- file.path(main_dir, "data", "cleaned_data", "all_trips_raw.csv")
cleaned_dir <- file.path(main_dir, "data", "cleaned_data")
outputs_dir <- file.path(main_dir, "data", "outputs")
if (!dir.exists(cleaned_dir)) dir.create(cleaned_dir, recursive = TRUE)
if (!dir.exists(outputs_dir)) dir.create(outputs_dir, recursive = TRUE)
cat("📂 Read the merged file from:\n", raw_merged_file, "\nn")## 📂 Read the merged file from:
## C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_raw.csv
## n
We use read_csv with show_col_types = FALSE because the file size is large
all_trips <- read_csv(raw_merged_file, show_col_types = FALSE)
cat("✅ File read. Data dimensions: ", dim(all_trips)[1], "rows x", dim(all_trips)[2], "columns\n\n")## ✅ File read. Data dimensions: 5859061 rows x 13 columns
## 🔎 Quick look at columns:
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
## 📋 Quick summary of some variables (skim):
## ── Data Summary ────────────────────────
## Values
## Name all_trips
## Number of rows 5859061
## Number of columns 13
## _______________________
## Column type frequency:
## character 7
## numeric 4
## POSIXct 2
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 ride_id 0 1 16 16 0 5859061 0
## 2 rideable_type 0 1 11 13 0 3 0
## 3 start_station_name 832009 0.858 3 64 0 1722 0
## 4 start_station_id 832141 0.858 3 36 0 1319 0
## 5 end_station_name 889661 0.848 3 64 0 1741 0
## 6 end_station_id 889802 0.848 3 36 0 1324 0
## 7 member_casual 0 1 6 6 0 2 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75
## 1 start_lat 0 1 41.9 0.0458 41.6 41.9 41.9 41.9
## 2 start_lng 0 1 -87.6 0.0287 -87.8 -87.7 -87.6 -87.6
## 3 end_lat 5973 0.999 41.9 0.0672 0 41.9 41.9 41.9
## 4 end_lng 5973 0.999 -87.6 0.106 -88.1 -87.7 -87.6 -87.6
## p100 hist
## 1 42.1 ▁▁▇▇▁
## 2 -87.5 ▁▁▆▇▁
## 3 42.4 ▁▁▁▁▇
## 4 0 ▇▁▁▁▁
##
## ── Variable type: POSIXct ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max
## 1 started_at 31 1.000 2022-05-01 00:00:06 2023-04-30 23:59:05
## 2 ended_at 23 1.000 2022-05-01 00:05:17 2023-05-03 10:37:12
## median n_unique
## 1 2022-08-28 12:44:56 4916295
## 2 2022-08-28 13:07:18 4930151
if ("member_casual" %in% names(all_trips)) {
all_trips <- all_trips %>%
mutate(member_casual = tolower(member_casual) %>% str_trim()) %>%
mutate(member_casual = case_when(
member_casual %in% c("member", "registered", "subscriber") ~ "member",
member_casual %in% c("casual", "customer", "guest") ~ "casual",
TRUE ~ member_casual
))
} else {
stop("⚠️ The column 'member_casual' does not exist in the data — check From the naming of the columns.")
}parse_safe_datetime <- function(x) {
res <- suppressWarnings(ymd_hms(x, quiet = TRUE))
if (all(is.na(res))) res <- suppressWarnings(ymd_hm(x, quiet = TRUE))
if (all(is.na(res))) res <- suppressWarnings(as_datetime(x))
return(res)
}
if (all(c("started_at", "ended_at") %in% names(all_trips))) {
all_trips <- all_trips %>%
mutate(
start_at = parse_safe_datetime(started_at),
end_at = parse_safe_datetime(ended_at)
)
cat("🕒 Started_at and ended_at have been converted to datetime (multiple attempts).\n")
} else {
stop("⚠️ The started_at or ended_at columns do not exist.")
}## 🕒 Started_at and ended_at have been converted to datetime (multiple attempts).
if ("ride_id" %in% names(all_trips)) {
before_dup <- nrow(all_trips)
all_trips <- all_trips %>% distinct(ride_id, .keep_all = TRUE)
after_dup <- nrow(all_trips)
cat("🧹 Remove duplicates according to ride_id:", before_dup - after_dup, "Deleted.\n")
} else {
cat("⚠️ No ride_id; we will attempt to remove exact duplicates if any exist.\n")
before_dup <- nrow(all_trips)
all_trips <- all_trips %>% distinct()
after_dup <- nrow(all_trips)
cat("🧹 Deleted", before_dup - after_dup, "Complete iterations.\n")
}## 🧹 Remove duplicates according to ride_id: 0 Deleted.
min_valid_min <- 0.1 # Less than 6 seconds is considered zero/irrational
max_valid_min <- 1440 # 24 hours is the reasonable maximum
before_filter <- nrow(all_trips)
all_trips <- all_trips %>%
filter(!is.na(ride_length_min)) %>%
filter(ride_length_min > min_valid_min & ride_length_min <= max_valid_min)
after_filter <- nrow(all_trips)
cat("🚫 Irrational trips removed (negative/zero/longer than limit):", before_filter - after_filter, "\n")## 🚫 Irrational trips removed (negative/zero/longer than limit): 26498
if ("start_station_name" %in% names(all_trips)) {
all_trips <- all_trips %>%
mutate(
start_station_name = ifelse(is.na(start_station_name) | start_station_name == "",
"unknown_start", start_station_name),
end_station_name = ifelse(is.na(end_station_name) | end_station_name == "",
"unknown_end", end_station_name)
)
}if ("start_station_id" %in% names(all_trips)) {
all_trips <- all_trips %>%
mutate(
start_station_id = ifelse(is.na(start_station_id) | start_station_id == "",
"unknown_start_id", start_station_id),
end_station_id = ifelse(is.na(end_station_id) | end_station_id == "",
"unknown_end_id", end_station_id)
)
}all_trips <- all_trips %>%
mutate(
date = as_date(started_at),
year = year(started_at),
month = month(started_at, label = TRUE, abbr = TRUE),
day_of_week = wday(started_at, label = TRUE, abbr = TRUE),
day_of_week_full = wday(started_at, label = TRUE, abbr = FALSE),
hour = hour(started_at),
is_weekend = if_else(day_of_week %in% c("Sat", "Sun"), TRUE, FALSE),
season = case_when(
month %in% c("Dec", "Jan", "Feb") ~ "Winter",
month %in% c("Mar", "Apr", "May") ~ "Spring",
month %in% c("Jun", "Jul", "Aug") ~ "Summer",
TRUE ~ "Fall"
)
)##
## 📊 Final Dimensions: 5832563 rows × 26 columns
##
## 🔍 Number of Null Values per Column (Top 10):
na_summary <- all_trips %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "column", values_to = "n_missing") %>%
arrange(desc(n_missing))
print(head(na_summary, 20))## # A tibble: 20 × 2
## column n_missing
## <chr> <int>
## 1 end_lat 759
## 2 end_lng 759
## 3 ride_id 0
## 4 rideable_type 0
## 5 started_at 0
## 6 ended_at 0
## 7 start_station_name 0
## 8 start_station_id 0
## 9 end_station_name 0
## 10 end_station_id 0
## 11 start_lat 0
## 12 start_lng 0
## 13 member_casual 0
## 14 start_at 0
## 15 end_at 0
## 16 ride_length_sec 0
## 17 ride_length_min 0
## 18 date 0
## 19 year 0
## 20 month 0
if ("start_station_name" %in% names(all_trips)) {
cat("\n🏁 Top 10 Start Stations (by trips):\n")
print(all_trips %>% count(start_station_name, sort = TRUE) %>% slice_head(n = 10))
}##
## 🏁 Top 10 Start Stations (by trips):
## # A tibble: 10 × 2
## start_station_name n
## <chr> <int>
## 1 unknown_start 827888
## 2 Streeter Dr & Grand Ave 74544
## 3 DuSable Lake Shore Dr & Monroe St 41189
## 4 Michigan Ave & Oak St 40182
## 5 DuSable Lake Shore Dr & North Blvd 40061
## 6 Wells St & Concord Ln 38411
## 7 Clark St & Elm St 36263
## 8 Millennium Park 34778
## 9 Kingsbury St & Kinzie St 34538
## 10 Theater on the Lake 33238
##
## 👥 Member vs Casual Distribution:
## # A tibble: 2 × 3
## member_casual n pct
## <chr> <int> <dbl>
## 1 casual 2346869 40.2
## 2 member 3485694 59.8
cleaned_csv <- file.path(cleaned_dir, "all_trips_cleaned.csv")
cleaned_rds <- file.path(cleaned_dir, "all_trips_cleaned.rds")
summary_file <- file.path(outputs_dir, "cleaning_summary.csv")
cat("\n💾 Save clean files...\n")##
## 💾 Save clean files...
summary_table <- tibble(
metric = c("n_rows", "n_cols", "n_unique_ride_id"),
value = c(nrow(all_trips), ncol(all_trips), ifelse("ride_id" %in% names(all_trips), n_distinct(all_trips$ride_id), NA))
)
write_csv(summary_table, summary_file)
cat("✅ cleaned CSV, RDS and summary file saved to:\n",
cleaned_csv, "\n", cleaned_rds, "\n", summary_file, "\n\n")## ✅ cleaned CSV, RDS and summary file saved to:
## C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.csv
## C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.rds
## C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/cleaning_summary.csv
monthly_trips <- all_trips %>%
group_by(year, month, member_casual) %>%
summarise(ride_count = n(), avg_duration_min = mean(ride_length_min, na.rm = TRUE), .groups = "drop")
write_csv(monthly_trips, file.path(outputs_dir, "monthly_trips_member.csv"))
cat("✅ Useful grouping tables have been saved inside:", outputs_dir, "\n\n")## ✅ Useful grouping tables have been saved inside: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs
cat("🎉 Data cleaning is complete! You now have a clean file ready for analysis, Exploratory Analysis (EDA), and Visualization.\n")## 🎉 Data cleaning is complete! You now have a clean file ready for analysis, Exploratory Analysis (EDA), and Visualization.
## 👉 Saved files:
## - C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.csv
## - C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.rds
## - C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/cleaning_summary.csv
cat("\nSuggested next step: Run the Exploratory Analysis script `03_analysis.R` for detailed graphs and statistics.\n")##
## Suggested next step: Run the Exploratory Analysis script `03_analysis.R` for detailed graphs and statistics.
main_dir <- here::here("C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy")
cleaned_file <- file.path(main_dir, "data", "cleaned_data", "all_trips_cleaned.csv")
visual_dir <- file.path(main_dir, "visualizations")
if (!dir.exists(visual_dir)) dir.create(visual_dir, recursive = TRUE)
cat("📂 Reading cleaned dataset from:", cleaned_file, "\n\n")## 📂 Reading cleaned dataset from: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.csv
all_trips <- read_csv(cleaned_file, show_col_types = FALSE)
cat("✅ Loaded:", nrow(all_trips), "rows ×", ncol(all_trips), "columns\n\n")## ✅ Loaded: 5832563 rows × 26 columns
usage_summary <- all_trips %>%
group_by(member_casual) %>%
summarise(
total_trips = n(),
avg_duration_min = mean(ride_length_min, na.rm = TRUE),
median_duration = median(ride_length_min, na.rm = TRUE)
)
print(usage_summary)## # A tibble: 2 × 4
## member_casual total_trips avg_duration_min median_duration
## <chr> <int> <dbl> <dbl>
## 1 casual 2346869 21.3 12.5
## 2 member 3485694 12.2 8.7
ggplot(usage_summary, aes(x = member_casual, y = avg_duration_min, fill = member_casual)) +
geom_bar(stat = "identity", width = 0.5) +
labs(title = "⏱️ Average Trip Duration by User Type",
x = "User Type", y = "Average Duration (minutes)") +
theme_minimal() +
scale_fill_manual(values = c("#2E86AB", "#E74C3C"))trips_by_day <- all_trips %>%
group_by(member_casual, day_of_week) %>%
summarise(total_rides = n(), avg_duration = mean(ride_length_min), .groups = "drop")
ggplot(trips_by_day, aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "📆 Number of Trips by Day of the Week",
x = "Day", y = "Number of Trips") +
theme_minimal()trips_by_hour <- all_trips %>%
group_by(member_casual, hour) %>%
summarise(total_rides = n(), .groups = "drop")
ggplot(trips_by_hour, aes(x = hour, y = total_rides, color = member_casual)) +
geom_line(size = 1.1) +
labs(title = "⏰ Trip Distribution by Hour",
x = "Time of Day", y = "Number of Trips") +
theme_minimal()monthly_usage <- all_trips %>%
group_by(year, month, member_casual) %>%
summarise(total_rides = n(), .groups = "drop")
ggplot(monthly_usage, aes(x = month, y = total_rides, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ year) +
labs(title = "📅 Monthly number of trips by user type",
x = "month", y = "number of trips") +
theme_minimal()if ("rideable_type" %in% names(all_trips)) {
ride_type_summary <- all_trips %>%
group_by(member_casual, rideable_type) %>%
summarise(total_rides = n(), avg_duration = mean(ride_length_min), .groups = "drop")
ggplot(ride_type_summary, aes(x = rideable_type, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "🚲 Using Bike Types According to User Type",
x = "Bike Type", y = "Number of Trips") +
theme_minimal()
ggsave(file.path(visual_dir, "ride_type_usage.png"), width = 7, height = 5)
}##
## 📊 Insights:
## 1️⃣ Members often use bikes during the week, while regular users use them on weekends.
## 2️⃣ The average trip duration for regular users is much longer than for members.
## 3️⃣ Members often ride for work commutes, while regular users ride for recreation.
## 4️⃣ Members use e-bikes more often → they prefer speed and efficiency.
insight_summary <- list(
usage_summary = usage_summary,
trips_by_day = trips_by_day,
trips_by_hour = trips_by_hour,
monthly_usage = monthly_usage
)
saveRDS(insight_summary, file.path(visual_dir, "eda_insights_summary.rds"))
cat("\n✅ All results and graphs are saved inside the visualizations folder.\n")##
## ✅ All results and graphs are saved inside the visualizations folder.
spatial_summary <- all_trips %>%
# 1. Grouping: By starting coordinates and member category
group_by(start_lat, start_lng, member_casual) %>%
# 2. Summarizing: Calculating the number of trips per group Coordinates
summarize(
ride_count = n() # The n() function counts the number of rows in each group
) %>%
# 3. Remove invalid values: (Optional but recommended for websites)
na.omit() %>%
# 4. Sort: By number of trips in descending order
arrange(desc(ride_count))
head(spatial_summary)write.csv(spatial_summary, "spatial_summary_for_tableau.csv", row.names = FALSE)
spatial_summary_end <- all_trips %>%
# 1. Group: By the end coordinates of a member
group_by(end_lat, end_lng, member_casual) %>%
# 2. Summarize: Counts the number of trips for each group of coordinates
summarize(
ride_count = n() # The n() function counts the number of rows in each group
) %>%
# 3. Remove invalid values: (Optional but recommended for websites)
na.omit() %>%
# 4. Sort by number of rides in descending order
arrange(desc(ride_count))
head(spatial_summary_end)main_dir <- here::here("C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy")
cleaned_file <- file.path(main_dir, "data", "cleaned_data", "all_trips_cleaned.csv")
outputs_dir <- file.path(main_dir, "data", "outputs")
if (!dir.exists(outputs_dir)) dir.create(outputs_dir, recursive = TRUE)
# load cleaned data (Do not re-clean here)
all_trips <- read_csv(cleaned_file, show_col_types = FALSE)
# 1) Remove rows that do not contain valid coordinates
# or that have illogical zero values
clean_coords <- all_trips %>%
filter(!is.na(start_lat) & !is.na(start_lng) & !is.na(end_lat) & !is.na(end_lng)) %>%
filter(!(start_lat == 0 & start_lng == 0 & end_lat == 0 & end_lng == 0))
# 2) Calculate the distance in meters and then convert it to miles (1 meter = 0.000621371 miles)
# distHaversine expects c(lon,lat)
compute_distance_miles <- function(slng, slat, elng, elat) {
m <- geosphere::distHaversine(c(slng, slat), c(elng, elat)) # meters
miles <- m * 0.000621371
return(miles)
}
# vectorised computation (fast with pmap)
distance_tbl <- clean_coords %>%
transmute(
ride_id,
rideable_type,
member_casual,
ride_length_min,
started_at,
ended_at,
start_lat, start_lng,
end_lat, end_lng
)%>%
mutate(
distance_miles = pmap_dbl(list(start_lng, start_lat, end_lng, end_lat),
~ compute_distance_miles(..1, ..2, ..3, ..4))
)
# 3) Remove illogical values (optional)
# Example: Any trip exceeding 100 miles within the city is likely a GPS error -> we label or delete it
distance_tbl <- distance_tbl %>%
mutate(distance_miles = round(distance_miles, 3)) %>%
filter(distance_miles <= 100) # You can adjust the limit as needed
# 4) Specify the final table we want to upload to Tableau
tableau_distance <- distance_tbl %>%
select(ride_id, started_at, ended_at, ride_length_min, distance_miles,
member_casual, rideable_type, start_lat, start_lng, end_lat, end_lng)
# 5) Save as one file (if the size is reasonable), otherwise split it into parts
out_file <- file.path(outputs_dir, "trips_with_distance.csv")
readr::write_csv(tableau_distance, out_file)
cat("✅ Saved full CSV to:", out_file, "\n")## ✅ Saved full CSV to: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance.csv
# 6) -- If you want to split the file into smaller parts when needed (e.g., 1,000,000 rows per file)
chunk_size <- 1000000
n <- nrow(tableau_distance)
if (n > chunk_size) {
cat("⚠️ Large file detected (", n, " rows). Splitting into chunks of ", chunk_size, "...\n")
num_parts <- ceiling(n / chunk_size)
for (i in seq_len(num_parts)) {
start_row <- (i - 1) * chunk_size + 1
end_row <- min(i * chunk_size, n)
part <- tableau_distance[start_row:end_row, ]
out_part <- file.path(outputs_dir, paste0("trips_with_distance_part", sprintf("%02d", i), ".csv"))
readr::write_csv(part, out_part)
cat("Saved:", out_part, "rows:", start_row, "-", end_row, "\n")
}
}## ⚠️ Large file detected ( 5831796 rows). Splitting into chunks of 1e+06 ...
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part01.csv rows: 1 - 1e+06
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part02.csv rows: 1000001 - 2e+06
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part03.csv rows: 2000001 - 3e+06
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part04.csv rows: 3000001 - 4e+06
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part05.csv rows: 4000001 - 5e+06
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part06.csv rows: 5000001 - 5831796
##
## Preview:
## Rows: 5,831,796
## Columns: 11
## $ ride_id <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83041…
## $ started_at <dttm> 2022-05-23 23:06:58, 2022-05-11 08:53:28, 2022-05-26 …
## $ ended_at <dttm> 2022-05-23 23:40:19, 2022-05-11 09:31:22, 2022-05-26 …
## $ ride_length_min <dbl> 33.350000, 37.900000, 21.833333, 8.700000, 5.016667, 7…
## $ distance_miles <dbl> 3.787, 0.375, 2.391, 0.499, 0.568, 0.797, 0.505, 0.950…
## $ member_casual <chr> "member", "member", "member", "member", "member", "mem…
## $ rideable_type <chr> "classic_bike", "classic_bike", "classic_bike", "class…
## $ start_lat <dbl> 41.89147, 41.88096, 41.88224, 41.88224, 41.88224, 41.8…
## $ start_lng <dbl> -87.62676, -87.61674, -87.64107, -87.64107, -87.64107,…
## $ end_lat <dbl> 41.94367, 41.88635, 41.90765, 41.88458, 41.88578, 41.8…
## $ end_lng <dbl> -87.64895, -87.61752, -87.67255, -87.63189, -87.65102,…
## # A tibble: 5,831,796 × 11
## ride_id started_at ended_at ride_length_min
## <chr> <dttm> <dttm> <dbl>
## 1 EC2DE40644C6B0F4 2022-05-23 23:06:58 2022-05-23 23:40:19 33.4
## 2 1C31AD03897EE385 2022-05-11 08:53:28 2022-05-11 09:31:22 37.9
## 3 1542FBEC830415CF 2022-05-26 18:36:28 2022-05-26 18:58:18 21.8
## 4 6FF59852924528F8 2022-05-10 07:30:07 2022-05-10 07:38:49 8.7
## 5 483C52CAAE12E3AC 2022-05-10 17:31:56 2022-05-10 17:36:57 5.02
## 6 C0A3AA5A614DCE01 2022-05-04 14:48:55 2022-05-04 14:56:04 7.15
## 7 F2AF43A242DF4555 2022-05-27 12:41:48 2022-05-27 12:50:41 8.88
## 8 377BE1F5F0E399CA 2022-05-29 19:19:24 2022-05-29 19:31:34 12.2
## 9 B136E0C969773F59 2022-05-16 17:48:44 2022-05-16 18:05:26 16.7
## 10 75F6A50A05E0AA18 2022-05-11 07:29:29 2022-05-11 07:30:57 1.47
## # ℹ 5,831,786 more rows
## # ℹ 7 more variables: distance_miles <dbl>, member_casual <chr>,
## # rideable_type <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>
##
## ✅ Done. Files ready in: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs
library(stringr)
# 📁 Folder Path Containing Images
img_dir <- "C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/visualizations/raport sld google"
# 🖼️ Get All Images Starting with the Same Name
imgs <- list.files(img_dir, pattern = "^Cyclistic bik presentation.*\\.png$", full.names = TRUE)
# 🧠 Function to Extract Number from Name (or 0 if None)
extract_number <- function(path) {
fname <- basename(path)
num <- str_extract(fname, "\\((\\d+)\\)")
ifelse(is.na(num), 0, as.numeric(gsub("[()]", "", num)))
}
# 🧮 Sort images by number
imgs <- imgs[order(sapply(imgs, extract_number))]
# 🖼️ Insert images sequentially into the report
for (img in imgs) {
cat(sprintf('{width=100%%}\n\n', img))
}